Identifying unstable CNG repeat loci in the human genome: a heuristic approach and implications for neurological disorders

Tandem nucleotide repeat (TNR) expansions, particularly the CNG nucleotide configuration, are associated with a variety of neurodegenerative disorders. In this study, we aimed to identify novel unstable CNG repeat loci associated with the neurogenetic disorder spinocerebellar ataxia (SCA). Using a computational approach, 15,069 CNG repeat loci in the coding and noncoding regions of the human genome were identified. Based on the feature selection criteria (repeat length >10 and functional location of repeats), we selected 52 repeats for further analysis and evaluated the repeat length variability in 100 control subjects. A subset of 19 CNG loci observed to be highly variable in control subjects was selected for subsequent analysis in 100 individuals with SCA. The genes with these highly variable repeats also exhibited higher gene expression levels in the brain according to the tissue expression dataset (GTEx). No pathogenic expansion events were identified in patient samples, which is a limitation given the size of the patient group examined; however, these loci contain potential risk alleles for expandability. Recent studies have implicated GLS, RAI1, GIPC1, MED15, EP400, MEF2A, and CNKSR2 in neurological diseases, with GLS, GIPC1, MED15, RAI1, and MEF2A sharing the same repeat loci reported in this study. This finding validates the approach of evaluating repeat loci in different populations and their possible implications for human pathologies.


INTRODUCTION
Spinocerebellar ataxia (SCA) and other neuromuscular disorders are part of a group of neurodegenerative disorders that share a common disease mechanism of tandem nucleotide repeat expansions 1,2 .The available literature and various databases have identified CNG nucleotide repeats as the most prevalent cause of these neurodegenerative diseases, such as spinocerebellar ataxia [3][4][5] .Nearly 30-40% of ataxia cases can be explained by trinucleotide CNG repeat expansions [3][4][5] .Thus, identifying whether CNG expansions in other genes are a causal mechanism for unexplained cases of ataxia and other neuromuscular diseases is imperative.
Earlier gene discovery efforts involving classical gene-mapping efforts identified these CNG expansions as causal events.However, rapidly identifying novel CNG expansions at the cohort level is difficult due to not only the rarity of the disease but also its clinical heterogeneity.Additionally, these genomic regions are considered dark regions, which are inaccessible using traditional methods that utilize short-read sequencing data.The high cost of long-read sequencing, along with time constraints, does not permit a wider scope for identification.
In 2004, Pandey et al. tested a different approach and computationally reviewed the CAG repeats in the entire genome, identifying two CAG loci as putative candidates for SCA 6 .
In this study, we used a combination of computational and genetic methods to identify possible disease-causing unstable repeat loci using a heuristic approach, which may serve as a costeffective solution.

METHODOLOGY Computational approach for TNR screening in the human genome
The FASTA sequences of individual chromosomes in the human reference genome (hg19) were downloaded from the UCSC genome browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/).The CNG repeat units associated with various neurological and neuromuscular disorders were selected from the literature for this analysis 1 .
A program to find all possible repeats with a minimum of 4 continuous repetitive units was written in python.
The methodology involved parsing a single whole-genome FASTA file.Using regular expression, repetitive patterns of specified repeat units within each chromosome were identified.Subsequently, a systematic iteration was conducted for each chromosome to detect potential repeat segments, with their repeat complement, including the starting and ending positions from the beginning of the respective chromosome; the length of the repeat; and the repeat unit as the output (source codes are available at https://github.com/bharathramh/STR_repeat/blob/main/str.py).Then, the positions were functionally annotated using the ANNOVAR offline version.We first downloaded the hg19 databases from ANNOVAR, and using the table_annovar.plcommand, we annotated the repeat regions.

Sample enrollment
A total of 100 patients with genetically uncharacterized SCA (retrospective + prospective cases) were enrolled.Patients exhibited autosomal dominant or X-linked inheritance and a sporadic late age of onset and were negative for SCA1, SCA2, SCA3, SCA6, SCA7, SCA8, SCA12, SCA17 and FRDA.
The control samples (N = 100) were obtained from the DNA repository of the Indian Genome Variation Consortium project 7 .We divided the analysis into two stages; the first stage focused on finding the unstable CNG sites in the genome, and the second stage investigated these unstable loci in patients with genetically uncharacterized SCA to identify disease-associated expansionprone novel repeat loci.
Evaluation of repeat length variability at the selected loci A total of 52 loci were targeted for CNG length estimation by polymerase chain reaction (PCR) using an M13-tagged forward primer, a reverse primer, and a fluorescently labeled M13-tagged primer.
For PCR amplification, the sample consisted of template human DNA (25 ng), PCR master mix (Epicentre's FailSafe mix or Promega master mix), 0.1 µl of forward primer (10 pM/µl), 0.4 µl of reverse primer (10 pM/µl), and 0.4 µl of M13-tagged primer (10 pM/µl) in a reaction volume of 10 µl.The PCR conditions were 95 °C for 3 min; 35 cycles of denaturation at 95 °C for 30 sec, annealing at 60 °C for 30 sec, and extension at 72 °C for 1 min; and a final extension at 72 °C for 5 min.The samples were analyzed using a fragment analyzer and visualized with GeneMapper software (version 4, Applied Biosystems).

Identification of CNG repeats from the human reference genome sequence
Through genome-wide CNG repeat selection, we found a total of 15,069 loci ( ≥ 4 contiguous repeats) (Fig. 1).The CNG repeats were abundant in the coding region and UTR.Overall, CAG and CTG repeats were most abundant across different genomic regions (Table 1).We annotated these repeat loci using ANNOVAR 8 and further categorized the tandem repeats based on the length observed in the reference genome: Group 1, 4-6 repeats; Group 2, 7-9 repeats; and Group 3, >9 repeats (Table 1).
Using a reductionist approach for further analysis, we selected 52 loci located in CDS region or UTR with a length of contiguous repeats ≥10 (Table 2 and Fig. 2).Repeats with more than 10 units are more prone to expansion events 9 and cause a decrease in the activity of flap endonuclease-1 (FEN1) on Okazaki fragments 10 .Furthermore, most pathogenic trinucleotide repeat expansions were observed in the coding region or UTR, for example, in SCA1-SCA3 (CAG expansion in the coding region), SCA12 (CAG expansion in the 5' UTR), and myotonic dystrophy (CTG expansion in the 3' UTR).
Genotyping of 52 CNG repeats in a control Indian population By assessing the length variability of the 52 loci in control samples, 33 loci were found to be relatively stable (length variability of 1-6 repeat units), and 19 loci were more polymorphic in nature (length variability of 7-23 repeat units).These 19 more variable repeat loci (RAI1, UMAD1, GLS, HTR7P1, CNKSR2, MAML3, MED15, MLLT3, USF3, MEF2A, MIR205HG, NCOR2, RPL14, JPH3, MAB21L1, ANKUB1, ERF, GIPC1, and EP400) were further screened in our ataxia patient cohort to identify any length variation that might be pathogenic (Fig. 3).The MAB21L1, ANKUB1, and GLS genes were highly polymorphic and had a wide range of repeat distributions in the population [modes of repeats (ranges): 13 (8-26), 15 (8-33), and 12 (6-29), respectively].The genes ANKUB1 and UMAD1 exhibited a large number of repeats ( > 30 repeats) in both the case and control groups.No significant difference in the large expansion range was observed between the case and control screenings (Table 2).
The heterozygosity indices (HIs, which measure the number of heterozygotes in the population) of UMAD1, MAB21L1, ANKUB1, GLS, and RPL14 were greater than 0.7 in both cases and controls.On the other hand, MLLT3 and CNKSR2 were less polymorphic and had more homozygous repeats (HI ≤ 0.1) in both groups.Most of the target loci fell within the range of 0.3 to 0.7, except for ERF, which had an HI of less than 0.25 in all samples.
Selection of unstable CNG repeats in the 1000 Genomes database Since disease-associated tandem repeats tend to be more polymorphic in the general population, we investigated the polymorphic nature of these loci in the control population.Compared with different 1000 Genomes control populations, the mode of repeats and variability in the GLS gene were greater in the African and SAS populations (Table 3).MAB21L1 exhibited a greater repeat range in the EAS population.Although some of the other loci had a maximum of >20 repeat expansions, these loci were uniform or less variable within the populations.MEF2A was highly variable, ranging from 2 to 16 repeats, but it was uniform throughout the population.GIPC1 repeat variability was less common in the EUR population.For MED15 and ERF, repeat data Fig. 1 Outline of the study design.In silico selected tandem repeats were assessed for their instability and further screened in patient population of spinocerebellar ataxia disorder.
were available for very few patient samples among different populations.We could not find any short tandem repeat data for the HTR7P1, RPL14, CNKSR2, or MLLT3 repeat loci.Our repeat data for the GLS, ANKUB1, EP400, JPH3, and RAI1 loci showed a biallelic distribution, which is also observed in other major populations.
Interestingly, we observed variability in the repeat ranges of USF3, MEF2A, JPH3, RAI1, ERF, MED15, MAML3, and UMAD1 compared to those of other world populations, but none of the differences were significant according to the Wilcoxon signed rank test (nonparametric test).Both our groups had comparatively fewer repeats for EP400 loci (Table 4).The probable reason for this difference is the use of different sequencing technologies; shortread sequencing was employed for the 1000 Genomes Project data.While short-read sequencing has its advantages, it also has some inherent inefficiency in regard to capturing long-range repeats and complex genomic regions.

Analysis of the expression levels of genes harboring unstable repeats
For all the candidate genes, the bulk tissue gene expression of each gene was compared among different tissues using GTEx 11 .The analysis showed that the CNKSR2, MAB21L1, USF3, RAI1, NCOR2, JPH3, MAML3, EP400, and GLS genes were significantly highly expressed in the brain, particularly in the cerebellum.All the other genes, except for MIR205HG, also exhibited significant expression levels in the brain (Table 2).Since the pathogenesis of SCA is associated with the brain, we excluded MIR205HG from the gene shortlist.Thus, we proposed the pathogenicity of the remaining 18 genes, which might show an ataxia phenotype.

DISCUSSION
Repeat instability is an underlying mutation mechanism for several neurodegenerative disorders in humans.Understanding the mechanism of repeat instability in disease manifestation has always been challenging.Several distinct hypotheses on repeat expansion have been proposed over the years, but its mechanism is not fully understood 2,[12][13][14][15] .Repeat instability in spinocerebellar ataxia is the most prevalent genetic manifestation worldwide.Identifying repeat expansion regions has always been challenging.In recent years, long-read next-generation sequencing has been an effective method for identifying these targets, but this method is costly and requires a large setup and personnel with highly qualified expertise.Here, we used a costeffective alternative approach for the investigation of tandem nucleotide repeats.
The initial phase of the study utilized a computational approach, yielding 52 suspected CNG repeat loci from various genes for further investigation in the Indian control population.Using a cost-effective fluorescent PCR-based fragment analysis approach identified 19 conclusive highly polymorphic repeat targets after screening the control samples.
Genetic markers for the same disorder have been shown to be expressed among various populations in diverse ways, with some diseases and genetic markers being population specific.Therefore, in the second phase of the study, we screened these putative candidates in patients with genetically uncharacterized clinically confirmed SCA.Although no large expansion of these target loci was identified in the study population, repeat polymorphisms in other populations of the 1000 Genomes Project were used as a proof of concept.We evaluated all identified unstable markers in different major populations and our control and patient samples to understand the population variability among these loci.We found repeat data for 15 of the 19 selected CNG loci in the 1000 Genome STR database.
V. Suroliya et al. the 18 identified highly unstable repeat loci, none exhibited large repeat expansions in our patient population.Multiple studies published in recent years on point and repeat expansion mutations for various neuro-related disorders from the proposed list of 18 genes support our adopted strategy in this study [16][17][18][19] .Variation in the length of CAG repeats in the RAI gene is associated with differences in age at onset in spinocerebellar ataxia type 1 patients among various populations 16 .In 2019, Rad et al. reported that a point mutation in MAB21L1 causes a syndromic neurodevelopmental disorder with distinctive cerebellar, ocular, craniofacial, and genital features (COFG syndrome) 20 .
Another study suggested that a point mutation in CNKSR2 is associated with seizures and mild intellectual disability 20 .In 2020, a report was published suggesting that frameshift mutations of GLI3, ANKUB1, and TAS2R3 might alter protein functions and accelerate the progression of polysyndactyly (PSD), an autosomal dominant genetic limb malformation 21 .The EP400 gene has been proposed to play a significant role in oligodendrocyte survival and myelination in the vertebrate central nervous system 22 .One study proposed that differences in the polyglutamine repeat length in MED15 change the expression of diverse stress pathways 17 .In various populations, CAG repeat variation in MEF2A is a risk factor   for coronary artery disease (CAD) 18 .A study published in 2020 suggested that a CGG repeat expansion mutation in the 5'UTR of GIPC1 causes oculopharyngodistal myopathy (OPDM), an adult-onset inherited neuromuscular disorder 18 .A large GCA tandem expansion in the 5' UTR of the GLS gene causes overall developmental delay, progressive ataxia, and elevated levels of glutamine 19 .Reported studies of GLS, GIPC1, MED15, RAI1, and MEF2A included the same candidate loci that we identified in our study 7,[13][14][15] .Although we did not identify any large repeat expansions, this previously reported evidence strengthens our study, indicating that our approach is in the right direction for the discovery of novel targets.
Limitations of the study 1.We considered only CNG repeats in the coding region and UTR with at least 10 continuous repeats due to the larger number of target loci.Considering other tri-, tetra-, penta-, and hexa-repeat units and loci with lower repeat numbers increases the chances of obtaining causal mutations.2. We collected 100 patient samples for the study.SCA is a rare disorder, and its subtypes are very rare; therefore, a larger sample size will provide more confidence in our hypothesis.3. Most SCA subtypes are geographic and population specific.
In this study, we considered only North Indian SCA patient samples, and a multipopulation study could enhance the possibility of identifying causal mutations among the studied genes.
This study highlights the importance of the population polymorphism approach for understanding the genetic background and mechanism of tandem repeat instability in ataxia-like neurological disorders.The role of other repetitive sequences in both coding and noncoding regions in the context of neurological disorders can be explored with the help of computational and polymorphism approaches, as in this work.

CONCLUSION
Although our study did not positively identify any novel pathogenic CNG trinucleotide repeat expansions, it still describes an approach that utilizes population-level genomics data to address the complex genetic mechanisms underlying disease pathology.The list of novel unstable loci that we identified can be examined in other neurological and neuromuscular disease cohorts, and a larger sample size may lead to the discovery of pathogenic expansions at these loci.

Fig. 3
Fig. 3 Distribution of target repeats among control and patient samples.

Fig. 2
Fig. 2 Distribution of repeat categories within groups, showing the percentage of repeats per category, with colors representing different variables.

Table 1 .
Categorization of CNG repeat loci based on location and number of repeats in the reference genome.

Table 2 .
List of 52 selected loci and their repeat status in control samples (unstable loci are marked in bold).

Table 3 .
Features and characteristics of 19 polymorphic loci.